A Two-Stage Incremental Annotation Approach to Constructing a Network Informal Language Corpus

نویسندگان

  • Yunqing Xia
  • Kam-Fai Wong
  • Robert Wing Pong Luk
چکیده

Network Informal Language (NIL) refers to the special human language widely used in the community of digital network chat via platforms such as chat rooms/tools, mobile phone short message services (SMS), bulletin board systems (BBS), emails, etc. NIL holds anomalous characteristics in forming words, phrases, and non-alphabetical characters. This makes it difficult to handle NIL text by conventional natural language processing (NLP) tools. Previous research reveals that knowledge based methods perform less effectively in processing unseen NIL expressions. This motivates us to construct an annotated NIL corpus which is used specially to develop and evaluate techniques for extraction and normalization of NIL expressions. A two-stage incremental annotation approach is proposed in this paper to construct a NIL corpus with minimal human involvement. Several experiments are conducted which reveal that the efficiency of corpus annotation can be improved greatly with this approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constructing A Chinese Chat Language Corpus with A Two-Stage Incremental Annotation Approach

Chat language refers to the special human language widely used in the community of digital network chat. As chat language holds anomalous characteristics in forming words, phrases, and non-alphabetical characters, conventional natural language processing tools are ineffective to handle chat language text. Previous research shows that knowledge based methods perform less effectively in processin...

متن کامل

Large Multi-lingual, Multi-level and Multi-genre Annotation Corpus

High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program ...

متن کامل

Annotation Scheme for Constructing Sentiment Corpus in Korean

This paper describes the first year of work constructing the Korean Sentiment Corpus, focusing on the theoretical background such as the annotation scheme. Our aim is to provide a solid theoretical background for the corpus which reflects the characteristics of the Korean language and includes approximately 8,050 sentences taken from news articles. The corpus annotation scheme, based on the MPQ...

متن کامل

Reliable Designing of Capacitated Logistics Network with Multi Configuration Structure under Disruptions: A Hybrid Heuristic Based Sample Average Approximation Algorithm

We consider the reliable multi configuration capacitated logistics network design problem (RMCLNDP) with system disruptions, concerned with facilities locating, transportation links constructing, and also allocating their limited capacities to the customers in order to satisfy their demands with a minimum expected total cost (including locating costs, link constructing costs, as well as expecte...

متن کامل

Grammar Extraction and Refinement from an HPSG Corpus

Grammar learning and refinement on the basis of language resources is very appealing in comparison with manual development of formal grammar. But in order to learn a complex grammar a complex resource is needed. Thus the creation of language resources and learning of grammars from them have to be aware of each other. In this paper we define a formal basis for annotation of corpora with respect ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005